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Abstract 

p-^ I The notion of typical sequences plays a key role in the theory of information. Central to the idea of 

^^, typicality is that a sequence xi, X2, . . . , a;„ that is Px -typical should, loosely speaking, have an empirical 

distribution that is in some sense close to the distribution Px- The two most common notions of typicality 
are that of strong (letter) typicality and weak (entropy) typicality. While weak typicality allows one to 
apply many arguments that can be made with strongly typical arguments, some arguments for strong 
typicality cannot be generalized to weak typicality. 

In this paper, we consider an alternate definition of typicality, namely one based on the weak* 
j^ • topology and that is applicable to Polish alphabets (which includes R"). This notion is a generalization 

of strong typicality in the sense that it degenerates to strong typicality in the finite alphabet case, and can 
also be applied to mixed and continuous distributions. Furthermore, it is strong enough to prove a Markov 
lemma, and thus can be used to directly prove a more general class of results than weak typicality. As 
CO ■ an example of this technique, we directly prove achievabiUty for Gel'fand-Pinsker channels with input 



constraints for a large class of alphabets and channels without first proving a finite alphabet result and 
^D \ then resorting to delicate quantization arguments. 



While this large class does not include Gaussian distributions with power constraints, it is shown to 
be straightforward to recover this case by considering a sequence of truncated Gaussian distributions. 

X 

C^ ' Index Terms 

Typical sequences, weak* topology, capacity, Gel'fand-Pinsker 



I. Introduction 

Perhaps the most intuitive method of deriving achievable rates in information theory for stationnary 
memoryless problems is with the concept of typical sequences. Roughly speaking, a sequence is typical 
if its empirical distribution is close, in some sense, to some ideal distribution. 

The author is with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, Ontario N2L 
3G1 (email: pmitran@ecemail.uwaterloo.ca). 
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Ignoring minor variations in definitions, there are essentially two broad notions of typical sequences. 
The first notion, called weakly typical sequences, measures the closeness of a sequence x = (xi, . . . , Xn) 
to a distribution Px by quantifying the probability of the sequence x. Specifically, the length n sequence 
X is {Px , e) -weakly typical if 



-y2logPx{xe)-H{X) 



< e, (1) 



where Px is a probability mass function (pmf) and H{X) the entropy of X if X is discrete while Px is 
a probability density function (pdf) and H{X) the differential entropy of X if X is continuous. For this 
reason, weakly typical sequences are often referred to as entropy typical. The notion of weak typicahty 
does not appear to generalize well to mixed distributions. 

By contrast, strong typicality characterizes a sequence x by the relative frequency of the occurrence 
of each letter of the alphabet of X. Specifically, if A^(ajx) denotes the number of occurrences of the 
letter a in the length n sequence x, then x is (Px, e)-strongly typical if 



|iV(ajx)/n-Px(a)| <e, (2) 



.11 



for all qH Strong typicality is sometimes referred to by the more descriptive name of letter typicality. 
Evidently, strong typicality implies at most a countable alphabet. 

Strong typicality has at least two key consistency properties not shared with weak typicality. First, 
strong typicality is sufficient for proving a Markov lemma, which is a key technique in many network 
information theory proofs. The Markov lemma is essentially a corollary of the following broad statement, 
which one would intuitively expect to be true for any reasonable definition of typicality: if a typical 
sequence, generated in some arbitrary fashion, is input to a stationnary memoryless channel, then the 
input and output sequences should be jointly typical in some sense. Unfortunately, this statement is not 
possible in general with weak typicality. We call this desirable property the channel consistency property. 

A second desirable property of a typical sequence involves cost functions. Specifically, let (/ : A' — > M 
be a mapping from the alphabet of X to the reals. If the length n sequence x is Px-typical, one would 
expect that the weighted sum - J^edi^^) would be reasonably approximated by Ep^ [g{X)]. While such 
a statement can be formalized for strong typicality, entropy typicality by itself is not sufficient to imply 
this propertjo We call this property the cost consistency property. 

'in some variations, thie additional condition that N{a\yi) = if Px{a) = is imposed. 

^It should be noted that for continuous/discrete distributions, a variation of so-called distortion typical sequences can resolve 
this issue for a specific cost function g{x). 
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Finally, typical sequences should satisfy certain large deviations results. For example, if Pxy is a joint 
distribution with marginals Px and Py, y is Py-typical and X is generated independently of y with 
each letter i.i.d. according to Px, then the probability of X and y being jointly Pxy-typical should be 
on the order of « 2~"'^^'^'^'. This result is usually shown for strong typical sequences. 

The contribution of this paper is to introduce a notion of typicality based on any metric that induces 
the weak* convergence of probability measures. We call this notion weak* typicality. The notion is 
sufficiently general to apply to any distribution (discrete, continuous or mixed) where the alphabet is 
a Polish space, and reduces to strong typicality for finite alphabets. This includes, for example, mixed 
distributions in M". 

We show that this notion of typicality allows for both of the above consistency properties in addition 
to the usual rules that one expects a typical sequence to follow, e.g., if a pair of sequences (x, y) are 
jointly typical, one expects that each of x and y are typical. As an example of this weak* typicality, we 
directly prove an achievability result for Gel'fand-Pinsker channels with alphabets in Polish spaces and 
cost constraint at the transmitter without having to resort to delicate quantization arguments which are 
typically handwaved. Indeed, a key contribution of this work is that by employing the notion of weak* 
typicality, one does not need to directly invoke quantization arguments. 

Two important remarks are in order While, the notion of weak* typicality avoids the technical difficul- 
ties of first proving a result in the discrete case and then employing delicate quantization arguments, some 
of the large deviation results for weak* typical sequences are proved by using quantization arguments. 
However, these arguments are not necessary for the application of weak* typical sequences. 

Second, for the cost consistency property to apply, the cost function must be bounded (and continuous). 
This initially precludes weak* typical sequences from being directly applicable to Gaussian input distri- 
butions with power constraints. However, the result in the Gaussian case can be recovered by considering 
a sequence of truncated Gaussians. 

The techniques proved here do not result in more general expressions for channel capacity. The most 
general expression for channel capacity is given by information spectrum methods [11], [19] which apply 
not only to channels with memory but even non-ergodic channels. The information spectrum approach 
though, looks at quantities such as 

I(X; Y) := p-liminf - log ^ ' / . (3) 

n->oo n Py"(Y") 

This characterization is based on ratios of probabilities and, similar to weak typicality, this characterization 
does not appear to allow for a Markov lemma. Nevertheless, the results presented here allow one 
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to generalize many results derived using strong typical sequences to a larger class of alphabets in a 
straightforward manner. 

Weak* convergence of probability measures has found some uses in the information theory literature. 
Perhaps the most relevant application to this work was by Csiszar to compute the capacity of arbitrary 
varying channels with general alphabets and states [5]. In that work, channels must satisfy a weak* 
continuity property and we adopt the same requirement for channels here. In [5], the capacity is first 
computed for the special case of finite input alphabets and this result is then used to derive the general 
case. It should be noted that while weak* convergence of measures plays a key role in [5], no new notion 
of typical sequences is introduced there. 

A key difficulty with continuous alphabets is the analytical characterization of mutual information. In 
[17] it was shown that channel capacity is a lower semi-continuous function in the weak* topology. In 
[18], sufficient conditions are found which ensure that channel capacity can be approached by discrete 
input distributions or uniform input distribution with finite support for general alphabet channels. In 
[8], necessary and sufficient conditions for weak* continuity and strict concavity of mutual information 
were found in addition to conditions that characterize the capacity value and capacity achieving measure 
for channels with side-information at the receiver. In [12] and [13], Keiffer proves coding theorems 
for stationnary (but not necessarily memoryless) channels that are weak* continuous. While a notion 
of typicality appears in [12] and [13], it is a variant of strong typicality and defined only on discrete 
alphabets. 

Weak* convergence has also found applications in analyzing the stability of recursive algorithms such 
as variants of the LMS adaptive filtering algorithm [2]. 

The rest of this paper is structured as follows. In Section |lll we define channel inputs, input constraints, 
channels and information measures such as divergence, as well as some measure theoretic considerations. 
In Section [TVl we define weak* typicality and prove our key results. In Section |Vl we provide two 
examples of weak* typical sequences by proving achievability results for point-to-point and Gel'fand- 
Pinsker channels. In Section |Vll we discuss the Gaussian case. In Section IVIIi we conclude this work. 
The Appendix contains two technical proofs for weak* typical sequences. 

II. Preliminaries 

A. Alphabets and weak* convergence 

In this paper all measureable spaces are Polish (i.e., complete separable metric spaces) and endowed 
with the Borel cr-field generated by the open sets. We denote a measureable space by E and the cr-field 
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by E. M.\(E) denotes the set of probability measures on E. When we need to distinguish two spaces, 
we employ subscripts such as {Ex,£x) and {Ey,£y)- Unless clear from context, a random variable X 
will take values in the alphabet Ex and have distribution C{X) which is usually denoted by Px- 

P{A) and E[X] denote the probability of event A and mean of random variable X where the underlying 
measure is always clear from context, or explicitly stated by subscript. 

If Ex and Ey are Polish, then so is Ex x Ey- The corresponding cr-field £xy '■= £x ® £y is, the 
smallest ir-field containing all rectangles AxB,AG£x,B££y. 

A sequence {xi, . . . ,Xn) € E"^ will be denoted by x". When the length is clear from context or 
not relevant, we simply write x. An i.i.d. random sequence X = {Xi, . . . ,X„) consists of a sequence 
of independent random variables Xi,X2, ■ ■ ■ ,Xn, each taking values in Ex and for which the laws 
C{Xi) = Px are identical. 

In this paper, the notion of weak* convergence of probability measures plays a key role. We denote 
the weak* convergence of a sequence of measures P„ to a limiting measure P by P = w-lim„^oo Pn- 
The Portemanteau theorem [15, Theorem 13.16] provides the following equivalent conditions which will 
be used in the sequel. 

Theorem 1: Let £' be a Polish space and P, Pi, P2, • • • £ ■M.i{E). Then the following are equivalent 

1) P = w-lim„_^oo -Pn- 

2) lim„_!.oo / / dPn = f f dP for all bounded continuous /. 

3) lim„^oo PniA) = P{A) for all A £ £ with P{dA) = 0, where dA denotes the boundary of A. 
We note that condition |2] is usually taken to be the definition of weak* convergence. 

B. Channels and Channel Inputs 

A memoryless channel from an input X to an output Y is described by a transition kernel Wy\x{B\x) 
for X G Ex, B £ £y which must satisfy the usual measurability conditions of a kernel. Furthermore, as 
in [5], we make the following additional continuity assumption on Wyijc (-Ix): 

Definition 2: A transition kernel VFy|x(-|2:) is said to be a channel if WV|x(-|2;) depends continuously 
(in the weak* sense) on x, i.e., 

Wylxi-\x) = W-limWyixi-lXn), (4) 

' n— ^oo ' 

whenever x = lim„ 3;„ . 
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Given a measure Px and a kernel Wy\x we denote by Px (8) Wy\x ^^d PxWy\x the measures on 
ii^x X Ey and £'y uniquely defined by 

{Px ® Wy\x){A xB):= f WY\x{B\x)Px{dx) (5) 

J A 

{PxWyix){B) := f WYix{B\x)Px{dx), (6) 

J Ex 

for Borel sets A £ £x and B G fy. When clear from context, we will denote the marginal PxWy\x of 
Px (S) Wy\x by Py- 

In practice, sequences are input to communication channels. In this paper, all sequences belong to 
product spaces, i.e., an input sequence x = (xi, . . . ,x„) belongs to x"^^i?x := E'^ while an output 
sequence, say Y", belongs to xf^-^^Ey := Ey- 

In general, an input sequence x results in a random output sequence Y described by a transition 
kernel Wy|x(^"|x) where i?" G fy. In this paper, all channels are stationnary and memoryless. Thus 
if a sequence x = {xi, . . . ,x„) is input into to a kernel VFy|x> then the probability that the output Y 
lies in a product set x"^^S^, Bi G £y, is 11^=1 ^y|x(-S^I^£)' where VFy|j5(:(-|a;) is assumed to satisfy 
the constraint of Definition |2l 

It is common for channel inputs to be required to satisfy some constraints. For example, the Gaussian 
channel typically has a power constraint 



1 "" 
n ^-^ 



2 < P. (7) 



We now establish the equivalent concept in this paper. 

Definition 3: We say that a function g{x^ and a threshold V form an input constraint provided g'(-) 
is continuous and bounded. We say that the input vector x = (xi, ■ • • jX^) satisfies the input constraint 
provided 



1 "^ 

-V5(x,)<r, (8) 

n ^-^ 



and with abuse of notation, we define (^(x) := - X]"=i di^i)- 

Remark 4: A bounded constraint or cost g{x) is sometimes assumed in the literature [16]. If the input 
alphabet Ex is compact, then any continuous g{x) is always bounded. While the bounded assumption on 
g{x) initially precludes a power constraint of the form ^ when the alphabet is R, we will see that the 
classic result in the additive Gaussian noise case with a power constraint can be recovered by considering 
a sequence (in L) of compact input alphabet Ex^ ■ Finally, if the sequence of empirical input distributions 
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on X converges weakly* to Px, then the continuity assumption can be relaxed to Px{Dg) = where 
Dg is the set of discontinuities of g{-) (see part (iii) of [15, Theorem 13.16]). 

C. Two Examples 

We now give two common examples of alphabets that satisfy the above constraints. 

First, consider two random variables X and Y whose alphabets Ex and Ey are finite. Then these 
trivially satisfy the assumptions of Section ITl-AI if we choose as metric the trivial metric dx{xi^X2) = 1 
if xi 7^ X2 and otherwise, and likewise for dy(-, •). Furthermore, in this case, any channel from X to 
Y satisfies the weak* continuity assumption of Section III-BI since if a sequence x„ — t- x, then in fact 
Xn = X for all n greater than some N, i.e. VF(-|a;„) = VF(-|2;) for all n> N. 

As a second example, we consider two random variables X and Y whose alphabets are Ex = Ey = 
M^. With the usual metric on the real line, these are Polish spaces. Furthermore, suppose that Y = X+Z, 
where Z is independent of X and has a density f{z), i.e.. 



Wy\x{B\x)= / f{y-x)dy. (9) 

Jb 

Then as shown in [17, Lemma 2], the channel Wy\x satisfies the weak* continuity assumption of Section 

lEl] 

III. Information Measures 

A. Definitions 

In this section, we provide some basic background on information measures and introduce a key result. 
Let P and M be two probability measures defined on E and let Q = {Qi, Q2, • • • ) Q|q|} be a finite 
(measurable) partition of E. Then, we define (see [10, Section 2.3]) 

Hp\\M{Q)■.= Y.P{Q^)\0g^^ (10) 

The divergence of P with respect to M is defined as (see [10, Section 5.2]) 

D{P\\M) = snvHp\\M{Q), (11) 

Q 

where the supremum is over all finite measurable partitions Q of E. 

Recall that a field T has the properties that i) E ^ F, ii) if F ^ F then F'^ ^ F and iii) F is closed 
under ^«/te unions. In the next subsection, we will construct a field that, in addition to generating the 
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(7-field, i.e., 8 = (t{F)^ has a desirable property. Fields that generate the cj-field are of particular interest 
due to the following two lemmas. 

Lemma 5: [10, Lemma 5.2.2] Let {E,£) be a measurable space, T a field that generates £ and P 
and M two measures defined on this space. Then 

D{P\\M) = sup Hp^^MiQ)- (12) 

The above lemma states that it is sufficient to restrict the finite partitions to subsets of a generating field. 

Suppose {Ex,£x) and {Ey,£y) are measure spaces, generated by the fields Tx and Ty respectively. 

Then the product cr-field £xy is generated by the field of rectangles Txy = J^x x J^Y- Thus, we have 

the following result (see [10, Lemma 5.5.1]). 

Lemma 6: Let Pxv be a measure on Exv and Px and Py the respective marginals on Ex and Ey. 

Then 

liX-Y) = D{Pxy\\Px X Py) = sup ^P^HI^xxpJQx x Qy). (13) 

B. A Special Field 

In this subsection, given a measure P, we will construct a generating field Fp with the key property 
that the P-measure of the boundary of any set in the field is zero. This last property is desirable as it 
will ensure that if a sequence of measures P„ converges weakly* to P, then Pn{A) -^ P{A) for all sets 

vie -Fp. 

While there is no lack of standard constructions for fields that generate the cr-field, given any such 
field F, there is no guarantee that P{dA) = for all A € -F and thus, it is necessary to construct such 
a generating field Tp specifically for each limiting measure P. 

Lemma 7: Let P by a probability measure defined on a Polish measure space {E,£). Then there 
exists a countable family ^ C £" of open sets that i) generates the Borel cr-field £, i.e., (t{A) = £, and 
ii) P{dA) = for all A e A. 

Proof: Since E is Polish, there is a countably dense subset of E, say £". For A to generate £, it is 
sufficient that for each x G E' there is a countably dense subset Rx of M"'' with each ball B{x,r) G A 
for all r G Rx- This is because then each open set of E is the countable union of balls in A. 

It thus remains to be shown that the sets R^ can be chosen such that each ball B{x, r) has P{dB{x, r)) = 
0. For r > 0, let Fx{r) = P{B{x,r)). Then Fx{r) is a non-decreasing, bounded below by and above 

^cr{T) is the smallest cr-field that contains the family T of sets. 
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by 1. Thus it has left and right limits and at most a countable number of jump discontinuities and thus, 
there is a countably dense subset R'^ of M"*" for which Fx{r) is continuous. We claim that choosing 
Rx = R'x will do. Since dB{x,r) C {y : d{x,y) = r}, then P{dB{x,r)) < Fx{r+) — Fx{r) where 
Fx{r+) is the right limit of Fx{r). However, for r € R'x, F^(r+) = Fx{r) by continuity. ■ 

Corollary 8: Let P be a probability measure defined on a Polish measure space {E,£). Then there 
exists a countable generating field Tp with the property that P{dA) = for all A € Tp. 

Proof: Since d{A ilB) cdAU dB, d{A UB) cdAUdB and dA = 9(A^), one can extend the 
countable family A in Lemma |7] to include all finite intersections/unions of balls in A and complements 
of balls in A, i.e., extend ^ to a field J-p. ■ 

IV. Weak* Typical Sequences 
A. Definitions 

Given a sequence x = (xi, . . . , x„) G E^, one can associate an empirical distribution P^ defined by 

1 " 

e=i 
When clear from context, we denote the empirical distribution by P^. Likewise, given two sequences 

X = (xi, . . . , Xn) € -B^ and y = (yi, . . . , y„) G Ey, the joint empirical distribution Px,y is defined by 

1 " 
Px,y(^ xB):=-Y, l{.,eA}l{,.eB}, (15) 

e=i 

which is denoted by P^y when clear from context. 

A sequence x should be typical if its empirical distribution is in some sense close to some probability 
measure. In this paper, closeness is measured with respect to the weak* topology. 

Specifically, let d{-,-) be any metric on the space of probability measures A4i{E) that induces the 
weak* topology and fix this metric for the rest of the paper. The Prohorov metric is an example of such 
a metric and the exact choice of the metric is irrelevant. 

We denote by B{M,e) the ball of distributions {P G Mi{E) : d{P,M) < e}. We will say that 
an empirical distribution P" is P-typical when its distance from P is sufficiently small. We make the 
following definitions. 

Definition 9: Let d{-,-) be a metric on the space of probability measures M-i{Ex) that induces the 
weak* topology. A sequence x = (xi, . . . , Xn) is said to be weakly* (Px, e)-typical if d{P:ii, Px) < £■ 
Similarly, we say that an empirical distribution P^ is weakly* (Px, e)-typical if d{P^,Px) < e- We 
denote the set of such length n typical sequences by A^{Px)- 
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Remark 10: For finite alphabets, it is interesting to compare the definition of weak* typicaUty to that 
of strong typicality. Specifically, if \Ex\ is finite and P^ € A^'{Px), then \Px{a) — Px{cl)\ < 6 for some 
6 > and all a € Ex, and (^ — > as e — > 0. This coincides with the definition of strong typicality 
(except for the occasional requirement that Px(a) = if Px{o) = 0). Thus weak* typical sequences can 
be viewed as a generalization of strong typical sequences. 

Unless stated otherwise, in the sequel all typical sequences are weak* typical sequences. 

We also find it convenient to introduce a notion of asymptotically typical sequences. Given a set 
of sequences x"^ , x"^ , • • • of lengths n\ < n2 < • • • , there is a corresponding sequence of empirical 
distributions P^\P^\---. 

Definition 11: We say that the sequence of sequences {x"''} is asymptotically Px -typical provided 
that the corresponding sequence of empirical measures satisfies 

Px = w-lim P^" . (16) 

Remark 12: If a sequence of sequences {x"''} is asymptotically Px -typical, then for any e > there 
exists a K such that for all k > K, x"*" is weak* {Px, e)-typical. 

It some cases it will be more convenient to first prove certain results for asymptotically typical sequences 
of sequences, and then as a corollary infer behavior of typical sequences for large length n. 

Jointly weak* typical sequences are defined analogously. Specifically, 

Definition 13: Two sequences x = (xi, . . . , Xn) and y = (yi, . . . , y„) are said to be weak* {PxY, e)- 
typical if d(Px.y, Pxy) < £■ Similarly, we say that the empirical distribution P^y i^ weak* (Pxy,e)- 
typical if diP^y^PxY) < £■ We denote the set of such pairs of length n weak* typical sequences by 

A^Pxy). 

Likewise, one can also define a sequence of a pair of sequences {(x"'',y"'')} to be asymptotically 
^xy-typical in the obvious way. 

B. Consistency Properties 

There are several desirable properties that typical and jointly typical sequences should possess. 

First, a random i.i.d. sequence should be typical with high probability. Second, a (Px, e)-typical 
sequence x should have a cost -J^idi^i) close to Ep^[g{X)]. Third, if two sequences are jointly 
typical, then one would expect each sequence to be typical in its own right. 

The following lemma shows that the first is indeed true for asymptotically typical sequences. 
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Lemma 14: Let Xi,X2,. . . be a sequence of independent random variables with values in Ex with 
identical distribution Px and {X"*-} a corresponding sequence of sequences. Then almost surely {X"'^} 
is asymptotically Px -typical. 

Proof: This is a direct restatement of Varadarajan's Theorem [7, Theorem 11.4.1]. ■ 

We now show that all three statements are true for weak* typical sequences, where the first is a 
consequence of the result for asymptotically typical sequences. 

Theorem 15: The following hold. 

1) Let Xi,X2,- ■ ■ , Xn be independent random variables with values in Ex with identical distribution 
Px- Then for any e > 0, 

lim P(X"G A^(Px)) = L (17) 

n— j-oo 

2) For every 6 > 0, there is an e{5) > such that if x € A^^^JPx) then 

\EpJg{X)] - Ep,[g{X)]\ < 6. (18) 

3) For any e > 0, there is an e(e) > such that if (x,y) G A^^JPxy) then x G ^"(Py)- 

Proof: i) Otherwise one could find a sequence of sequences {X"*" } in Lemma [14] that would not 
be asymptotically typical almost surely. 

ii) Let Mx be any (not necessarily empirical) measure on Ex- By part |2] of Theorem [T] there is an 
e((5) such that Mx G B{Px,e{6)) implies 

\EMAgiX)] - EpA9iX)]\ < S. (19) 

The result follows since x G A'}^^^{Px) iff Px (^B{Px, e(<5)). 

iii) By part |3] of Theorem [T] if M^y is a sequence of (not necessarily empirical) measures in Exy 
with marginals M^ and such that limfc d{M^Y^ Pxy) = 0, then limfc d{M'^,Px) = 0. Thus for each e, 
there is an e(e) such that d{MxY,PxY) < e(e) implies d{Mx,Px) < £■ The result again follows since 
(x, y) G ^^(,)(Pxy ) iff Px,y G B{PxY,e{e))- ■ 

An important desirable property of typical sequences is that if a typical sequence is input to a channel 
then the input and output should be jointly typical in some sense. We have the following theorem. 

Lemma 16: Let the sequence of sequences {x"*^} be asymptotically P^ -typical and consider the 
output sequences Y"'' generated by the stationary memoryless channel Wy\x{'\^)- Then the sequence 
of sequences { (x"'= , Y"*- ) } is asymptotically Px ® Wy|x-typical almost surely. 
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Remark 17: Consider the Markov chain X — Y — Z. Then if a sequence of sequences {(x", y")} 
is asymptotically Pxy -typical and is used to generate output sequences Z" according to the channel 
Wzixvi'l^^y) = ^z|y("|y)' then the sequences {(x", y", Z")} are almost surely asymptotically Pxy ^ 
VF^|y-typical, i.e., the Markov Lemma holds for asymptotically typical sequences. 

Proof: For ease of notation, we consider the case n^ = k. Let P^ denote the empirical distribution 
of the fc-length sequence x^'. Let Py denote the marginal PxWyix ^rid let P^y denote the empirical 
distribution of the pair of sequences x^ and Y'^. 

The outline of the proof is as follows. First, for each of Px and Py, consider two fields Tx and 
Ty as described in Corollary [8] We will first show that almost surely for all A G Tx, B € Ty, 
linikP^yiA X B) = Px CS) Wy\x{A x B). Thus, by [1, Chapter 1, Theorem 2.2], Px (S> Wy\x = 
w-limfc P^Y since each open set of Ex x Ey is a countable union of rectangular sets A x B, A ^ Tx, 

B(^Ty. 

Now, consider a set ^4 G Tx and B G Ty and observe that 



PxY{AxB)-P\y{AxB) 



< 



Pxy{A xB)-P^® Wy\x{A X B) + P^® WyixiA X B)- PxyiA x B 



(20) 



where Pxv = Px ^ Wy\x- From Lemma 2 of [5], since Px = w-lirrifci-*^, then Px ^ Wy|x = 
w-limfc P^ ® Wyix- Since Px O Wy|x(9(A x B)) < Px{dA) + Py{dB) = 0, it follows that linifc P^ ® 
Wy\x{A x B) = Px ® Wy\x{A X B) and thus the first term on the right side of (EO]) is in the limit. 
As for the second term on the right side of (l20l) . we note that 



1 ^ 
Px ® Wy\x{A xB)-P^y{AxB) = -Y, l{x.eA} [Wy\x{B\x,) - 1 



Hi^.eBlJ 



(21) 



i=l 



Let Zi = l|j.^gyi} [VFy|x(^ki) — l{yg_B}]. Then, the second term on the right of ( [201) is 

1 ^ 



i=l 



(22) 



Note that given the non-random sequence x'^, i) the random variables Zi are independent, ii) E[Z.-i\ = 0, 
and iii) supj var[Zi] < 1 since —1 < Zi < 1. Then by [15, Theorem 5.29], the sum in (l22l) converges to 
almost surely, i.e., 

Px ® Wy\x{A xB)- P^y{A X B) 



lini 







(23) 



almost surely. Since the family of sets Tx and Ty are countable, it follows that almost surely the right 
side of (l20l ) vanishes as A; — > cxo for all A G Tx and B G Ty. 
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Theorem 18: Let x" be an input sequence to a stationnary memoryless channel VFy|x(-|a;) and let 
Y" be the corresponding output sequence. For every e > and 6 > 0, there exists an e(e, 6) such that 
if x" G A"/ 5)(^x) for all n greater than some A^, then 

liminf P ((x", Y") G A'^iPx Wy\x)) > I - 6. (24) 

Proof: Suppose that for a given e and b, no such e(e, S) can be found. Then, we can find a 
sequence of sequences {x"'=} with corresponding empirical measures P'^ for increasing n]^ such that 
d{Pl\Px) < 1/k, i.e., Px = w-liuikP^" and 

liminf P r(x"\ Y"*^') G A^'=(Px ^ Wyix)] < I - S. (25) 

fc->-oo 

But this contradicts the almost sure asymptotic {Px (8) Tyy|x)-typicality of (x"'', Y"*-) in Lemma [T6l ■ 

C. Large Deviations 

The next theorems provide some large deviations results for weak* typical sequences. The first theorem 
looks at the probability of a random i.i.d. sequence drawn according to a law P to be weak* M-typical. 
This is shown to be « 2~"^(^^ll^) which is the same result as for weak and strong typical sequences. 
For example, let Pxy be the joint law for Ex x Ey with marginals Px and Py. Then the probability 
that a pair of sequences X and Y drawn according to Px <8) Py is Pxy-typical is Ki 2""^^^'^^. 

The next two theorems then consider the more special case of when the sequence y is non-random 
and known to be weak* typical but X is random. There, we again show that the probability that the 
pair of sequences is weak* typical is ps 2""^^^'"'^'. This result is normally proved for strong typical 
sequences using the notion of conditional strongly typical sequences and no analog exists in general for 
weak typical sequences. 

Theorem 19: Let P and M be measures on a common probability space {E, £) and let a random 
sequence X" be chosen i.i.d. according to the law C{Xi) = P, i.e., draw the sequence X'* according to the 
measure /x" = (8>f=iP. Define the sequence of probabilities a„ = //"^(X" G A'^{M)), i.e., the probability 
that the drawn sequence is weak* (Af , e)-typical. If D{M\\P) is finite, then there is an e{6) > such 
that for all e < e{6) 

-D{M\ \P) < lim inf - log a„ < lim sup - log a„ < -D(M\ \P) + 6. (26) 

n n 

If D{M\\P) = oo, then for each L > there is an e(L) such that for e < e(L), the right side of (|26ll is 

-L. 
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Remark 20: Under these assumptions, it follows that for any 5 > 0, and 6 > 0, there is a sufficiently 
large N such that for all n > N, 

;u"(X" € A^(M)) < 2"(-^(^^ll^)+'5+^). (27) 

Since both 5 and 6 are arbitrary, they can be absorbed into a single "6" term. 

Proof: Let P^ be the empirical measure of the drawn sequence. We first show the lower bound. 
By the definition of weak* typicality, we recognize that 

an = fi'^iX.^ G A^(M)) = /x"(Pl G BiM, e)), (28) 

where B{M, e) is an open ball in the space A4i{E) in the weak* topology. Then, by Sanov's Theorem 
(Corollary 6.2.3 of [6]) we have the large deviations principle 

- inf A*(i/) <liminfilog^"(P^ G5(M,e)). (29) 

i/eB{M,e) n 

The lower bound then follows since in the weak* topology h*{v) = D{v\\P) (Lemma 6.2.13 of [6]) and 
the inf is bounded by selecting any choice of z^ G B{M, e), say v = M. 

To prove the upper bound, we use a quantization argument. Consider a field Tm as described in 
Corollary M for M. 

If D{M\\P) is finite, then M ^ P, and for any 6 > 0, choose a sufficiently fine finite partition 
Q C Taj of E such that the induced discrete probabilities Qp and Qm of P and M on the atoms of Q 



satisfjc 



D{Qm\\Qp) > DiM\\P) - 6/2. (30) 



Otherwise, D{M\\P) = oo and for each L > 0, we can find a partition Q such that D{Qm\\Qp) > 2L. 

In either case, denote these atoms by Ai, . . . , Ak for some integer K. Let Qp be the induced discrete 
probability of the empirical measure P^ on the atoms of Q. Then, the event P^ G B{M,e) implies 
that the discrete probabilities \QM{Ak) — Qp{Ak)\ < 6i for all atoms and 5i ^ as e — )■ by weak* 
convergence since M{dAk) = 0. 

Now, Qp is an empirical probability for a random sequence over a finite alphabet. Let T be the set of 
all probability distributions Qp on the atoms of Q such that IQa/ (^fe) — Qp{^k)\ < ^i for all k. In the 
finite alphabet case we have the following well-known large deviations result ( [6, Theorem 2.1.10]) 

Iimsup-log/i'^(P^ G B(M,e)) < limsup i log^"(Q'?, G T) < - inf D(QJ\Qp). (31) 
n n Q„er 

''since D{M\\P) is finite, this is straightforward by Lemma |5] 
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If D{M\\P) is finite, then M ^ P, and the divergence D{Qi^\\Qp) is continuous on the compact set 
of Qjy such that Q^ <^ Qp which includes Qm and D{Q^\\Qp) is infinite otherwise. Thus for small 
enough Si > 0, the inf can be bounded by D{Q]\f\\Qp) — 62 where 62 ^- as 61 ^- 0. Hence, pick e 
small enough that 62 < S/2. 

If D{M\\P) = cxD, there are two cases to consider. 

First, if Qm ^ Qp, then the same argument as the finite D{M\\P) case above shows that 

- inf DiQ^WQp) < -2L + 62, (32) 

where 62 < L can be ensured by choosing e small enough. 

Second, if we do not have Qm ^ Qp, then there is a set A G Q such that Qm{A) > and 
Qp{A) = 0. If e is chosen sufficiently small that 61 < M{A)/2, then Qu^A) > for all Q^ £ T and 

^(Q^II<9p) = 00 for aligner. 

Either way, — infg^gr D{Q^\\Qp) < —L, where L > is arbitrary. ■ 

Theorem 21: Let Pxy be a joint distribution on Ex x Ey and Px and Py denote its marginals. Let y" 

be a sequence and X" a random sequence drawn i.i.d. according to //'^ = 'Sif^iPx- If D{Pxy\\Px x Py) 

is finite then for each 6 > 0, there are e{6) and e{6) such that if e < e{6), e < e{6) and y" S A^^Py) 

for all n greater than some N, then 

limsupilogM"((X",y") G A^iPxy)) < -D{Pxy\\Px x Py) + 5. (33) 

n n 



If D{Pxy\\Px X Py) = 00, then for every L > 0, there is a sufficiently small e,e > such that (l33l) 
holds with the right side replaced by —L. 

Proof: See Appendix. ■ 

Theorem 22: Let Pxy be a joint distribution on i?x x Ey and Px and Py denote its marginals. Let 

y" be a sequence and X" a random sequence drawn i.i.d. according to /i" = C5>f=xPx- Then, for each 

5 > and e > there is an e(e, 5) > such that if y" E A^, 5){Py) ^^^ ^11 ^ greater than some N, then 

liminf-log/i"((X",y") G A^(Pxy)) > -D{Pxy\\Px x Py) - 5. (34) 

Proof: See Appendix. ■ 

V. Examples 

We now apply the notion of weak* typical sequences to prove achievability results for two channel 
coding examples. The first is the traditional point-to-point channel. While more general results can be 
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obtained using information spectrum methods, the example highUghts the application of weak* typical 
sequences. 

In the second example, we apply weak* typical sequences to Gel'fand-Pinsker channels. These results 
cannot be obtained for arbitrary Polish spaces using weak/strong typical sequences. 

In this section, the cost constraint g{x) is continuous and bounded. In Section |Vll we will consider 
the Gaussian case with power constraint. 

A. Point-to-Point Channel 

We consider communicating over a channel Wyix^ where the alphabets Ex and Ey are Polish spaces. 
For completeness, we briefly state some definitions. 

An (?i, M, Pg) code is a set of M codewords xi, . . . , xa/ and a decoder (j) '■ Ey ^- {1, . . . , M} such 
that the average probability of error is 

1 ^'^ 
^-=mT. ^[-^(Y) / ^|X = X,]. (35) 

v=l 

A rate R is said to be achievable if there is a sequence of codes {n, M^, P^) with block lengths ?i, 
R = lim„ ^ log M„, and probability of error P" — > 0. 

We will show the following well known result using weak* typical sequences. 

Theorem 23: Let Wy\x be a communication channel with Polish input and output alphabets and input 
constraint {g{x),T). Then any rate 

R< sup I{X]Y) (36) 

P^:Ep^[g{X)]<V 

is achievable. 

Remark 24: The converse can be obtained with the usual Fano inequality. 

Proof: The proof follows the usual random coding argument with the exception that we now use 
the results derived for weak* typical sequences. Pick any 7 > 0. We will bound the probability of error 
for a random code by 87 for large enough n. 

In particular, as usual, pick a Px which satisfies the constraint Ep^ [g{X)\ < T. We generate M„ = 
[2"^J codewords of length n with each entry i.i.d. according to Px, and denote these as Xi, . . . , Xa/„- 

The encoder transmits Xy where V is uniform among the indices {1, . . . , Af„}. The decoder employs 
weak* typical decoding. Specifically, it looks for an index v such that (X^,, Y) are weak* {Px(>^Wy\x, e) 
typical for some e > and declares v as the transmitted index if such a v exists and is unique. Otherwise, 
an error is declared. 
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By the usual symmetry, without loss of generality we assume that the index t; = 1 is selected at the 
transmitter. The probability of error for a {Px (iS> Wy\Xj e)-typical decoder is then bounded as 

P^ < Pb(Xi) > r] + P[(Xi, Y) ^ A^{Px Wyix)] + P[U,^i(X,, Y) G A^{Px Wyix)]- (37) 

By partdJof Theorem [JH P^^ := P[(Xi, Y) ^ Al'{Px ® Wy\x)] ^ as n ^ 0. Thus P^^ < 7 for 
all n larger than some A'^2- 

Second, we note that by Theorem |19fl and Remark |20l for any index v ^ I and any n greater than 
some sufficiently large N^, 

P[(X,„ Y) e A'^iPx Wyix)] < 2-"(^(^'^)-'5) (38) 

and (5 — 7> as e — 7> 0. Thus, for large enough ?i, the union bound implies 

P!,n ■■= P[U.^i(X,, Y) G A^iPx Wy\x)] < 2"«2-"(^(^'^)-'5), (39) 

and Pg^,„ < 7 for all n larger than some N-^ provided R < I{X; Y) — 6. 

Finally, one can upper bound P^^ := P[g(Xi) > T] by P[Xi ^ A^{Px)] + -Pb(Xi) > r|Xi G 
A^i^Px)] for any arbitrary q > 0. 

By part |2] of Theorem [15] and since Ep^ [9{X)] < T, there is a sufficiently small a > such that 
P5, := P[5(Xi) > r|Xi G A2(Px)] = 0. 

By part [T] of Theorem [TSl for any a > 0, P^„ := P[Xi ^ A^^{Px)] vanishes as n — )■ cxo. Thus 
-f(^n < 7 foJ" ^11 ^ larger than some A'4. 

Thus for any rate R < I{X; Y) — 6, the bound in ( [37l ) is at most 87 for all n sufficiently large. Finally, 
6 > can be made arbitrarily small by choosing e small enough. ■ 

Remark 25: Since Ep^ [g{X)\ < T and each codeletter of each codeword is i.i.d. , one could have 
bounded ^[^(Xi) > F] by the strong law of large numbers. However, this approach will not be possible 
for Gel'fand-Pinsker channels as the channel input is not generated by independent and randomly chosen 
codeletters. 

B. Gel'fand-Pinsker Channels 

We now consider proving an achievability result for Gel'fand-Pinsker channels assuming Polish al- 
phabets. The achievability result for R < I{U; Y) — I{U; S) was proved in the discrete case in [9]. The 
Gaussian case with additive interference and noise was considered in [4] and further results on additive 

^Here we assume that I{X; Y) is finite. Tiie case tiiat I{X; Y) = 00 can be considered separately. 
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interference and noise can be found in [3], [14], [20], [21]. Here, we consider achievability for a general 
channel Wy\sx with Polish alphabets directly using weak* typical sequences. 

We start with a brief set of definitions. A source sends a message V £ {1, . . . , M} selected uniformly 
at random to a receiver by transmitting a sequence x. The channel Wy\xs results in an output Y that 
depends stochastically on the input x as well as an interference sequence S, where S is an i.i.d. random 
sequence drawn according to Ps- Furthermore, the encoder is aware of the interference sequence S 
apriori and the decoder is unaware of the interference. Thus, the encoder is described by the mapping 
C :{l,...,M}xE^-^ E\ while the decoder is the mapping (f)"^^ : ^^ ^ {1, . . . , M}. 

A code is a tuple (^['^.M, 0;;,Pe) where P^ = P[C(Y) 7^ V\X. = ^^"^(y, Y)]. A rate R is said 
to be achievable if there exists a sequence of codes ((/)"x, M„, (;^J!^, PJ^) with lim„-logA/„ = R and 

limn Pe = 0. 

We have the following achievability result for Polish alphabets. 

Theorem 26: For Gel'fand-Pinsker channels with cost constraint {g{x),T) at the transmitter, any rate 

R< sup I{U]Y)-I{U]S), (40) 

Pu\s,Wx\us.E[g{X)]<T 

is achievable where the supremum is over all transition kernels Pu\s and all channels Wx\us- 

Remark 27: Recall that a channel is a transition kernel that satisfies a weak* continuity condition. 
Proof: Again, the random coding argument is followed, however we now invoke weak* typical 
sequences. Pick any 7 > 0. We will show that for any n larger than some A^, the probability of error 
with a random codebook is at most 47. 

Specifically, pick a transition kernel Pij\s ^^id channel Wx\us which satisfy the constraint E[g{X)] < F 
and let 5 > 0. First, construct A/„ = [2"^J bins, with each bin containing [2"(^(^''^)+'^'] sequences of 
length n with each codeletter generated i.i.d. according to the marginal Pu. We denote these sequences 
as Ui,U2, . . ., U/^ where K = [2"-^J x [2"(^('^''^)+'^)]. 

Following the usual argument, to encode message v £ {!,■■■ , Af„}, the encoder looks in bin v for 
a sequence Uj such that (Uj,S) € A^^{Pus), i.e. (Uj,S) are weakly* {Pus,ei) typical for some 
appropriate ei > 0. 

If there is no such Uj sequence, an error is declared, which is denoted by the event Ei. Otherwise, 
the encoder constructs a sequence X generated by the memory less channel Wx\us- If 

1 " 

-y2g{Xe)>T, (41) 
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then the transmission of X would violate the channel input constraint and an error is declared, denoted 
by the event E2. Otherwise, X is transmitted over the channel. 

The receiver obtains Y and looks in all bins for a pair of sequences (Uj, Y) that are jointly (-P^/y, €3)- 
typical for some appropriate 63. If there is a unique such pair, then the bin index in which Uj is present 
is declared as the estimate v of v. If the index is not unique, or the bin incorrect, we denote this error 
event by E3. 

The probability of error is bounded by 

Pe < P[Eo] + P[Ei n Eo] + P[E2|Ei] + P[E3|Ei], (42) 

where Eq is the event that S is not (eo, P5)-typical for some suitable eo > 0. 
Error analysis: 
We start by analyzing P[E3|Ei] and note that 

P[E3|E"i] < P[E4|Ei] + P[E5|Ei, E4], (43) 

where E4 is the event that (Uj, Y) are not jointly (P^/y, e3) -typical and E5 is the event that there is an 
index j ^ i such that (Uj, Y) are jointly {Puy, e3)-typical. 

By applying Theorem [18] twice, first to the channel Wx\us ^^id then to the channel Wy\xs ^ri^ 
using Part [3] of Theorem [TSl there is an eus{es,'y/2) such that for any e^^ < eus(e3,7/2), if (Uj,S) is 
(P(75,eus) -typical, then P^^ := P[E4|Ei] < 7/2 for all n greater than some A'^4. 

Now, conditioned on E4, by Part [3] of Theorem [TSl Y = y is (Py,ey )-typical, where ey ^- as 
e3 — )■ 0. By Theorem 1211 for any ^3 > 0, there is sufficiently small e3((53), 63(^3) and large N^ such that 
provided £3 < €3(153) and y € A" ,g > for n > N^, by the usual union bound 

P[E5|Ei,E4] < 2"«[2"(^(^'^)+'^)l2-"(^(^'^)-^^), (44) 

Thus, selecting £3 < €3(53) and 63 small enough that ey < €3(63,), P^^ •= P[E5|Ei,E4] < 7/2 for all 
n greater than some N^ provided R < I{U; Y) - I{U; S) - 63 - 6. Note that ^3 > and 5 > are 
arbitrary. 

We now analyze P[E2|Ei]. For any £2 > 0, there is an ei(e2,7) > such that for ei < ei(e2,7) by 
Theorem [18] 

P2, := P[(U,S,X) ^ A^iPusx)] < 7 (45) 

for all n larger than some A'^2- By Part [2] of Theorem [T5l there is small enough 62 such that P[(jr(X) > 
r I (U,S,X) G AliPusx)] = 0. Select ei < min{e„,(e3, 7/2), £1(62,7) }■ 
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We now analyze P[Ei n Eq]. Let 6i > he arbitrary. By Theorem l22l there is an Ni, eq = eo(ei, ^i) 
such that conditioned on the fact that if S = s is {Ps, eo) -typical, 

P[(U,s) e AliPus)] > 2-'^(^(^;^)+^^), (46) 

for n > Ni. Thus, 



P[EinEo] =Ep, P[Ei|S = s]l|seA" 



(Ps)} 



= Ep, 
<Ep, 

< 



[l-P[{V,s)eAliPus)]]' 1 



{SeA^JPs)} 



1 _ 2-"(^(f^;'S')+'5i) 

2n(I{U;S} + 5) 



2n(I{UiS) + 5) 



4SG^?JPs)} 



1 _ 2-"(^('^;-S')+5i) 
< exp ('-2-"(^i-^)) , 



(47) 
(48) 
(49) 

(50) 
(51) 



where Pg = <8)"=iP5- Thus, selecting 5i < 5 yields Pg^„ := P[Ei|Eo] < 7 for all n greater than some 

Finally, we analyze P[Eo]. However, by Part[I]of TheoremfBl for any eo > 0, P^q := P[S ^ ^"^,(^5)] 
vanishes as n ^> and thus P"q < 7 for all n greater than some Nq. ■ 

Remark 28: We note the following remarks. First, we could not rely on the law of large numbers 
to argue that X would satisfy the constraint pair {g{x),T). This is because while X was generated 
stochastically, it was done so based on the pair (Uj, S), where Uj was specifically chosen to satisfy a 
given property. Thus instead we argued via Theorem [TSl that the triple (Ui,S,X) is £2 -typical (by the 
channel consistency property or Markov Lemma) and then that X must satisfy the power constraint for 
sufficiently small 62- 

Second, we again employed the channel consistency property (or Markov Lemma) to prove that the 
pair (Uj, Y) are jointly typical with high probability. 

VL The Gaussian Case 

In Section |Vl achievability results were proved for a point-to-point channel as well as the Gel'fand- 
Pinsker channel with input constraints. It was noted that due to the input constraint {g{x), F), either the 
(continuous) cost function g{x) should be bounded, or the input alphabet is compact (which trivially 
implies g{x) is bounded). 

This rules out the consideration of an input constraint {g{x) = x'^,a\) with a Gaussian input 
distribution as neither the cost function nor the input alphabet is then bounded. 
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In this section, we show how one can recover the traditional achievability results in both cases for 
Gaussian distributions. Specifically, we will consider an input alphabet over the interval Ex^^ = [—L, L] 
and show that as L ^- oo, one can arbitrarily approach the well-known results in the Gaussian case. It 
should be noted that we consider all alphabets as subsets of M for simplicity of exposition only and the 
arguments apply equally well to alphabets over R". 

A. Point-to-Point Channel 

Here the capacity of the channel Y = X + Z with Z ~ A/'(0, (t|) is well known to be C = I{Y] X) 
evaluated for X ~ 7\A(0, cr^ Ji 

Now, consider the family of random variables Xl (indexed by L > 0) with densities 



/Xx. {x) 



x<-L 

fx{x)/K{L) -L<x<L , (52) 

x> L 



where fx{x) is the PDF of an J\f{0,a'j^) distribution and 

KiL)=f fx{x)dx (53) 

is a normalization constant with the property that lim/,^oo K{L) = 1. 

For notational convenience, let the output random variable be denoted by Yl when the input is X^, 
i.e., Yl = X^ + Z, and let the output random variable be Y when the input is X ~ N{Q,a\), i.e., 
Y = X + Z. \t is straightforward to verify that E[Xl] < P for all L, and thus this input distribution 
satisfies the input constraint, and the rate 

RL = IiYL;XL) (54) 

is achievable for each L. We will show that as L — > oo, that Rl = I{Yl;Xl) -^ I{X;Y). 
First, note that 

I{YL;XL) = h{YL)-hiYL\XL) (55) 

= h{YL)-h{Z), (56) 

and thus it suffices to show that limL_s.oo ^(^l) = h{Y). 

^While X ~ Af{0, ax) does not satisfy the input constraint, X ~ A/'(0, aj^ — e); < e < ax does and the capacity follows 
by a simple limit argument. 
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Second, 



fvr. (y) 



fY\xiy\x)fxAx) dx 



oo 

1 



K{L) 



fY\x{y\x)fx{x) dx 



Define 



gvAy) ■= / fY\x{y\x)fx{x) dx, 

then fy^y) = gYAy)/K{L), and 

/■CX) 

KYl)= I /yJy)logl//yJy)dy 

9yL(y) log l/5yz,(y) dy 



oo 

1 



K{L) 
1 



K(L) 



oo 
oo 



5'yi,(y) log 5y^(y) 



/y, (y) log i^(L)(iy 
(i2/-logK(L). 



(57) 
(58) 

(59) 

(60) 
(61) 
(62) 



Since Wthl^oo K{L) = 1, it remains only to show that the term in the square brackets converges to 
—h{Y) as L — )• oo. However, because of (l59l) . gvLiy) is continuous, strictly positive, strictly increasing 
in L and converges pointwise to friy). Thus, the integrand (^y^ (y) log ^y^ (y) is continuous and converges 
pointwise to /y (y) log /y (y) as L ^ oo. 

Let ^ = {y G M|/y(y) < e^^}. Since xlogx is decreasing in x for < x < e~^, then for y ^ A, 
/yi. (y) log /y^ (y) is decreasing in L and 



lim 

L— >oo 



9YAy)^oggYdy) dy= / /y(y) log /y(y) dy 



(63) 



by the (Lebesgue) monotone convergence theorem. 

Now consider the set B = A'^ = {y ^ M|/y(y) > e~^} and note that B is closed (since fyiy) is 
continuous) and bounded (since fyi^y) is a Gaussian pdf) and thus B is compact. Thus, on the set B, 
fvLiy) converges uniformly to fyiy) by Dini's theorem. Hence, there is a large enough L such that 
for all L > L, fvAy) > ^"^ for ^U y ^ B. Let K = sup^gR/y(y). Then for y € i? and L > L, 
|5y^(y) loggy^(y)| < /y(y) max{2, |logi^|} which is integrable. Thus, by the dominated convergence 
theorem. 



lim 

L—^oo 



9Yl (y) log QY^ (y) dy= / friy) log /y (y) dy 



(64) 



B 
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and therefore 

9Yr^ (y) log gvr. iv) dy = lim / gy^ (y) log gy^ {y) dy + lim / gy^ {y) log gy^ {y) dy (65) 

/oo 
fr {y) '^og fy (y) dy (66) 

-oo 

= -/i(>^), (67) 

as desired. 

B. Gel'fand-Pinsker Channels 

In the Gaussian case, it is well-known that the capacity is obtained with the choice 

U = X + aS (68) 

Y = X + Z + S, (69) 

where X ~ 7\A(0, o-^) and independent of S, and a is an appropriately chosen constant. 

We follow the same strategy as in Section lVI-AI Namely, we consider the family of truncated Gaussians 
Xl given in (|52l ). and obtain 

Ul = Xl + aS (70) 

Yl = Xl + Z + S. (71) 

As previously, we will show that 

lim I{Ul\ S) = I{U; S) (72) 

L— >oo 

lim I{Ul;Yl) = I{U;Y). (73) 

L— >oo 

First, note that I{Ul',S) = h{UL) — h{XL). Second, lim/,_j.oo /i(C^l) = h{U) follows by the same 
argument as in Section IVI-AI with Ul replaced by Yl and aS replaced by Z. Third, lim/,_j.oo h{Xi) = 
h{X) since 

/L /"OO 

fx{x)\ogfx{x) dx= I fx{x)\ogfx{x) dx. (74) 

-L J-oo 

Next, we note that I{Ul]Yl) = h{UL) + KYl) - h{UL, Yl) and limL^oo KUl) = h{U) was already 
argued, and YmiL^ao hiYi) = h{Y) follows similarly. We provide an outline of lim^^^oo ^(t^L,^L) = 
h{U,Y). Because of the additive nature (1681 ) - (TtTI ). if fuY\xiu,y\x) denotes the conditional PDF of U 
and Y given X, then 

fUr^,YAu,y) = j^TjT / fuY\x{u,y\x)fx{x) dx. (75) 
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Thus, following the same argument as in Section IVI-AI one can apply the monotone and dominated 
convergence theorems and obtain liraL^oo h{UL,Yi) = h{U,Y). 

VII. Conclusion 

In this paper, a notion of typical sequences based on the weak* topology was defined. This notion 
of typical sequence applies to discrete, continuous and mixed distribution and was shown to satisfy 
consistency properties normally associated with strongly typical sequences. As examples of applying 
these notions of typical sequences, achievable rates were proved for the traditional point-to-point channel 
and Gel'fand-Pinsker channels with Polish alphabets and input constraints. 

Appendix 

Proof of TheoremUT} Pick two fields Tx and Ty as described in Corollary [H and let Qx C Tx and 
Qy C Ty be two partitions of size \Qx\ = Nx and \Qy\ = Ny and elements Qx = {^i, • • • ,Aj\i^}, 
Qy = {Bi, . . . jBxy}- 

Let PxY be the empirical measure induced by the pair of sequences (X", y"). Let Qxy' Qx ^^'^ Qy 
denote the empirical measures and empirical marginals induced on the partitions Qx x Qy- Furthermore, 
for Bj such that QyiBj) > 0, define the conditional measure Q^^yi^ilBj) = Qxyi^i ^ Bj)/Qy{Bj), 
otherwise Q^,Y{Ai\Bj) is arbitrary. 

Likewise, starting with Pxy, let QxY, Qx and Qy denote the induced measures and marginals on the 
partitions Qx x Qy- For Bj such that Qy{Bj) > 0, define Qx\Yi^i\Bj) = QxYi^i x Bj)/Qy{Bj), 
and note that Qx\y{-\Bj) ^ Qx{-)- Otherwise pick Qx\Y{Ai\Bj) to be some arbitrary distribution for 

which Qx|y(-I^i) < Qx(.)- 

Now, pick an ei > and note that (X",y") G ^^(Pxy) implies P^y ^ B{PxY,e) which for 
sufficiently small e, itself implies that for all i and j, 

\Q3cYiA xBj)- QxYiAi xB,)\<ei. (76) 

Therefore, with this choice of e, 

^n ((X-, y-) E A^{Pxy)) < ^" I n i\QxYiA X Bj) - Qxy{A, X B,)\ < ei} j . (77) 
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Let J^Q denote the set of j such that Qy{Bj) > and select e{6) > small enough such that y" G 

m{B,) - Qy{Bj)\ < Ei Vj (78) 

(79) 



QviBi) 



Qy{Bj) 

Now, by ( [78] ). for all j ^ j7>o ^nd all i, ( 1761 ) is satisfied and the right side of the bound in ( [77] ) can 
be limited to the intersection of all i and all j € j7>o- Furthermore, since for j € j7>o, ([76l) implies 



+ QxiY{Ai\Bj)QUBj) - Qx\Y{Ai\Bj)QY{Bj)\ < ei, 
together with ([78l ). this implies 

and for j G i7>o> with ([79l ) this implies 

Qx\YiAi\Bj) - Qx\Y{Ai\Bj] 



2ei 



< 



2ei 



Qy(i?,)(l-ei)- 
Let 62 = maxjej-^Q q (b^)(i-<: ) " "'"'^^'^ £2 — > as ei — ^ and for all i and j G J>o, 

Therefore, we have shown that 

/i" ((X",y") G A^{Pxy)) < /^" (n„e:7>„ {|Q^|y(^^|i?,) - Qx|y(^^|i?i)! < ^2}) 

= n ^^''{n^[\Q^x\YiAi\B,)-Qx\YiA^\B,)\<e2}). 

Now, let Nnj = Y^^=i '^{yeeBj} for ^^y j G j7>o, i.e., the number of letters of y" that are in Bj. Then 
for a given j G :7>o, 



(80) 

(81) 

(82) 

(83) 

(84) 
(85) 



limsup-log/i" (Oi {iQ'^^^YiAilBj) - Qx\Y{Ai\Bj)\ < £2} 

= limsup^ X limsup-^log/x" (n, ||Q^|y(^i|5j) - Qx\Y{Ai\Bj)\ < £2)) (86) 
< -(1 - e^)QY{Bj) X [D(Q^|y(.ji?,-)||Qx(-)) - 5i,i] , 



(87) 
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where 6ij — > as e2 — )• since Qx\y{'\Bj) ^ Qx{') and we have used Theorem 2.1.10 of [6]. 
Therefore, 

limsup-logAX™((X",y") e A^{Pxy)) < -{^ - ei)D{QxY\\Qx x Qv) + 62 (88) 

n—¥oo n 

= -(l-ei)i/p^^l|P^^P^(Qx X Qy) + 52, (89) 

where (^2 = (1 - ei) Ejej>o ^^d- 

If D(Pxy II ^x X A^) is finite, the result then follows by first choosing appropriate fine quantizers Qx 
and Qy such that 

-Hp^^WP^^P^iQx X Qy) < -D{Pxy\\Px x Py) + <5/2, (90) 

and then choosing ei small enough (thus e{5) and e{5) small enough) such that 

-(1 - ei)i?p,,,|iP,,xP,(Qx X Qy) + <52 < -D{Pxy\\Px ^ Py) + 6. (91) 

If D{Pxy\\Px X Py) = 00 then for every L > 0, we can find a pair of quantizers such that 

^Pxy||PxxPy-(Qx X Qy) > 2L. The result follows by choosing ei small enough (thus e and e small 

enough) such that (1 — ^i)Hpxy\\Px^Py{Qx x Qy) — 62 > L. Thus the right side of (l33l ) is less than 

—L for any positive L. ■ 

Proof of Theorem \22\ The case that D{Pxy\\Px x Py) = cxo is trivial, thus we only consider finite 

D{Pxy\\Px X Py). 

We first show that for each e > 0, there is a finite partition Qx = {Aj} and Qy = {Bj} of Ex and 
i?y, and A(e) such that 

{P^Y e B{PxY, e)} D fl {\P^YiAi X P,) - Pxy(^. x P,)| < A(e)} . (92) 

id 

and A(e) — ^ as e ^' 0. 

To see this, let J^x and Ty be fields as described in Corollary [8] For k = 1,2,..., let Q^ C Tx 
be a sequence of successively finer finite partitions of Ex in the sense that if A € Q^ then A is the 
union of atoms of Q;^^ and for each A € Tx, A is the union of atoms of Q^ for some k. Likewise 
for Qy C Jy- We denote the atoms of Q^ by A'l, i = I, . . . , \Qx\ '■= ^X' ^"'^ likewise for the atoms 
B^ of Q^. 

Consider any sequence M^y of distributions that satisfy the sequence of events 

fl { iM^YiA'i X B^) - PxYiA^ X P^^)| < Afc} (93) 
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where Xk := l/{k x N^ x N^). Then for any A G T^ and B G Ty, M'^y{A x B) -> Pxy{A x B). 
Thus by [1, Chapter 1, Theorem 2.2], the sequence of events in ( |93l ) impUes Pxy = w-limfc M^y. This 
impUes limfc (i(M^y, Pxy) = 0. Let e^ = sup (i(M^y, Pxy). where the supremum is over M^y such 
that ( |93] ) holds at the A;th step. We must have e^ — > or there would be a choice of M^y satisfying (|93] ) 
such that Pxy = w-lim^ M^y does not hold. Let K be such that ex < £■ 

Hence, we can pick Qx = Q^, Qy = Qy and any A(e) < Xk- Therefore, with these choices, 

liminfilog^"((X",y") G^^(Pxy)) 

n-T-oo n 



(94) 



> liminf ilog^" i^\{\PxYiA^ X Bj) - PxY^Ai x B,)\ < A(e)} 

n— >-oo n \ ' ' 

>W _D(P^y||P^xPy)-<5, (95) 

where inequality (a) is justified below. 

To justify inequality (a), first let QxY, QxY^ ^^^ denote the appropriate induced distributions on the 
partitions of Qx and Qy as in the proof of Theorem |2T] Then, for any a > 0, 

Q^|y(A|P)-Qx|y(^|i?)|<a (96) 

implies 

my{A xB)- Qxy{A x B)\ < aQy{B) + |Q^(P) - Qy{B)\ . (97) 

If e(e,5) is sufficiently small that y" G A'}^^^^{Py) implies |Q?-(P) - Qy(P)| < a for all P G Qy 
with a < A(e)/2, then the event 

f|{|g^,y(yli|P,)-Qx|y(Ai|P,) < a} (98) 



h3 



is a subset of 



fl {\P^y{A, X Bj) - PxYiAi X P,)| < A(e)} . (99) 



hJ 
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Thus 



liminf - log^" I n {iPjJy (A, X Bj) - PxY{A^ x B,)\ < A(e)} . 

> limmi- log fi''\n{\Ql,y{Ai\Bj)-QxiY (MB j) < a}] (100) 

>-Y,{<:)Y{B,) + a)D{Qx\Y{-\B,)\\Qx{:)) (101) 

3 

j 
> -D{Pxy\\Px >< Py) - S, (103) 

where (5 = aJ2j D{Qx\y{^Bj)\\Qx{-)) is finite[]as Z)(Pxy||-Px x Py) < oo. Furthermore, 6 can be 
made arbitrarily small by choosing a small enough, which can be assured by choosing e(e, 6) small 
enough. 
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